Skip to content

fix(opensearch): Update should clear projects and personas when they are empty#8845

Merged
acaprau merged 12 commits intomainfrom
andrei/260226/3/opensearch/fix-update-projects-and-personas-empty
Mar 10, 2026
Merged

fix(opensearch): Update should clear projects and personas when they are empty#8845
acaprau merged 12 commits intomainfrom
andrei/260226/3/opensearch/fix-update-projects-and-personas-empty

Conversation

@acaprau
Copy link
Contributor

@acaprau acaprau commented Feb 27, 2026

Description

Currently empty projects and personas are treated as None, meaning no update required, when really an empty list should mean clear these respective fields from the chunk.

This is a followup to #8680 (comment).

How Has This Been Tested?

I went through the effort of adding external dependency unit tests for this function of the old document index interface. Hopefully these fixtures can be used to easily add additional tests when needed. This took me more time and effort than I wanted, please clap.

Additional Options

  • [Optional] Please cherry-pick this PR to the latest release version.
  • [Optional] Override Linear Check

Summary by cubic

Fixes updates so sending empty user_projects/personas clears those fields in both OpenSearch and Vespa. Project/persona filters no longer match cleared chunks after clearing.

  • Bug Fixes
    • In update_single for both backends, treat empty lists as “clear”; send empty sets to clear fields and use None when fields are omitted (no-op).
    • Added external dependency tests for the old index interface across OpenSearch and Vespa, covering indexing, clearing, filtered retrieval, and Vespa schema readiness; minor test fixture cleanup to avoid module-scoped leakage.

Written for commit 1388cdd. Summary will update on new commits.

@acaprau acaprau requested a review from a team as a code owner February 27, 2026 01:25
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR fixes a bug in OpenSearchOldDocumentIndex.update_single where empty user_projects and personas lists were being treated as None (no-op) instead of clearing those fields on the stored chunks. The fix changes the truthiness check (if user_fields.user_projects) to an explicit if user_fields.user_projects is not None check, consistent with the Vespa backend which already handled this correctly. An external-dependency integration test is added to validate the clear-on-empty behaviour end-to-end against both OpenSearch and Vespa.

Key changes:

  • opensearch_document_index.py: The only logic fix — project_ids and persona_ids conditions now use is not None so that an empty list is converted to an empty set() and propagated to MetadataUpdateRequest, which in turn writes an empty list to OpenSearch and correctly clears those fields.
  • vespa/index.py: No logic change — the existing is not None checks were already correct; clarifying comments are added for consistency.
  • search.py: Comment-only reformatting.
  • test_document_index_old.py: New integration test file that indexes chunks with a non-empty project/persona, calls update_single with empty lists, and asserts the chunks are no longer reachable via project/persona-scoped filters. One TODO on line 287 is missing the required name/ticket annotation (e.g. TODO(ENG-XXXX)(name):).
  • test_opensearch_migration_tasks.py: Adds a cross-reference TODO comment pointing to the fixture consolidation ticket.

Confidence Score: 5/5

  • This PR is safe to merge; it is a minimal, well-scoped bug fix with an accompanying integration test.
  • The logic change is a single-line predicate correction (user_fields.user_projectsuser_fields.user_projects is not None) that aligns OpenSearch behaviour with the already-correct Vespa implementation. The downstream handling of an empty set in MetadataUpdateRequest is confirmed to be correct (the is not None guard at the update layer passes the empty list through to OpenSearch). The only issue found is a minor TODO format violation in the test file.
  • No files require special attention beyond the minor TODO annotation fix in test_document_index_old.py.

Important Files Changed

Filename Overview
backend/onyx/document_index/opensearch/opensearch_document_index.py Core bug fix: changes the condition for project_ids/persona_ids in update_single from a truthiness check (treating [] as None/no-op) to an explicit is not None check, correctly allowing empty lists to clear those fields in OpenSearch.
backend/onyx/document_index/vespa/index.py Documentation-only change: adds clarifying comments to the already-correct is not None checks for project_ids/persona_ids; no logic changes needed here.
backend/onyx/document_index/opensearch/search.py Comment reformatting only — no logic changes; knowledge scope comments are reflowed to fit within line length limits.
backend/tests/external_dependency_unit/document_index/test_document_index_old.py New external-dependency test file covering the clear-on-empty fix for both OpenSearch and Vespa; one TODO at line 287 is missing the required name/ticket annotation. Known limitations (time.sleep, hardcoded previous chunk count) are explicitly documented with comments.
backend/tests/external_dependency_unit/opensearch_migration/test_opensearch_migration_tasks.py Trivial change: adds a TODO(ENG-3764)(andrei) cross-reference comment to the fixture duplication note; no logic changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["update_single(doc_id, user_fields)"] --> B{user_fields is None?}
    B -- Yes --> C["project_ids = None\npersona_ids = None\n(no-op for both fields)"]
    B -- No --> D{user_fields.user_projects\nis None?}
    D -- "Yes (None)" --> E["project_ids = None\n(no-op)"]
    D -- "No ([] or [1,2,...])" --> F["project_ids = set(user_projects)\n(e.g. set() clears the field)"]
    E --> G{user_fields.personas\nis None?}
    F --> G
    G -- "Yes (None)" --> H["persona_ids = None\n(no-op)"]
    G -- "No ([] or [1,2,...])" --> I["persona_ids = set(personas)\n(e.g. set() clears the field)"]
    H --> J["MetadataUpdateRequest\n(None fields are skipped;\nempty sets clear the field)"]
    I --> J
    C --> J
    J --> K["OpenSearch / Vespa\nupdate chunks"]
Loading

Comments Outside Diff (1)

  1. backend/tests/external_dependency_unit/document_index/test_document_index_old.py, line 287-288 (link)

    TODO missing name or ticket

    The TODO on this line doesn't include an associated name or ticket number, which is required for all TODOs in this codebase. It should follow the style TODO(name): ... or TODO(1234): .... The other TODOs in this same file (e.g., TODO(ENG-3764)(andrei):, TODO(ENG-3765)(andrei):) already conform to this pattern.

    Rule Used: Whenever a TODO is added, there must always be an ... (source)

Last reviewed commit: 1388cdd

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files

Confidence score: 4/5

  • There’s a moderate risk of test failures: module-scoped fixtures in backend/tests/external_dependency_unit/document_index/test_document_index_old.py depend on function-scoped tenant_context, which can trigger pytest ScopeMismatch errors.
  • This is limited to the test suite and doesn’t imply a runtime regression, so the overall merge risk remains low.
  • Pay close attention to backend/tests/external_dependency_unit/document_index/test_document_index_old.py - fix fixture scoping to avoid ScopeMismatch errors.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/tests/external_dependency_unit/document_index/test_document_index_old.py">

<violation number="1" location="backend/tests/external_dependency_unit/document_index/test_document_index_old.py:60">
P2: Module-scoped fixtures cannot depend on a function-scoped fixture. `tenant_context` should be module-scoped (or removed from those fixtures) to avoid pytest `ScopeMismatch` errors.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

🖼️ Visual Regression Report

Project Changed Added Removed Unchanged Report
admin 0 0 0 119 ✅ No changes
exclusive 0 0 0 8 ✅ No changes

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 6 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/scripts/restart_containers.sh">

<violation number="1" location="backend/scripts/restart_containers.sh:32">
P2: The new `--keep-opensearch-data` flag is parsed but never removed from positional args, so passing the flag without volumes (or before them) turns it into the Vespa volume value and breaks the docker run command. Parse args into a filtered list or shift after handling the flag before assigning volumes.</violation>
</file>

<file name="backend/onyx/document_index/vespa/indexing_utils.py">

<violation number="1" location="backend/onyx/document_index/vespa/indexing_utils.py:282">
P1: Avoid logging raw `e.response.text` for non-retryable HTTP errors; it can expose sensitive response content in logs. Log only safe metadata such as status code and document id.

(Based on your team's feedback about avoiding raw exception/URL token leakage in logs.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

Preview Deployment

Status Preview Commit Updated
https://onyx-preview-1cuvqmi3k-danswer.vercel.app b7dcbad 2026-03-10 20:26:11 UTC

Comment on lines +273 to +280
@pytest.fixture(scope="function")
def index_batch_params() -> Generator[IndexBatchParams, None, None]:
yield IndexBatchParams(
doc_id_to_previous_chunk_cnt={"test_doc": 0},
doc_id_to_new_chunk_cnt={"test_doc": 5},
tenant_id=get_current_tenant_id(),
large_chunks_enabled=False,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded doc_id_to_previous_chunk_cnt will break future tests sharing module-scoped backends

doc_id_to_previous_chunk_cnt={"test_doc": 0} is hardcoded to 0, which is only correct on the very first index call. The document_indices fixture is scope="module", meaning the same OpenSearch and Vespa backends persist across all test functions in this module. When a second test function uses this fixture and calls document_index.index(...), the backend already has 5 chunks for "test_doc" from the previous test run, but the batch params still claim 0 prior chunks exist. This can lead to orphaned/duplicate chunks that make subsequent assertions incorrect.

The PR description explicitly states these fixtures should be reused for additional tests, so this will become a real problem quickly. Consider making the previous chunk count dynamic or resetting it to the correct value based on backend state, or using a per-test unique doc_id.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/tests/external_dependency_unit/document_index/test_document_index_old.py
Line: 273-280

Comment:
**Hardcoded `doc_id_to_previous_chunk_cnt` will break future tests sharing module-scoped backends**

`doc_id_to_previous_chunk_cnt={"test_doc": 0}` is hardcoded to 0, which is only correct on the very first index call. The `document_indices` fixture is `scope="module"`, meaning the same OpenSearch and Vespa backends persist across all test functions in this module. When a second test function uses this fixture and calls `document_index.index(...)`, the backend already has 5 chunks for `"test_doc"` from the previous test run, but the batch params still claim `0` prior chunks exist. This can lead to orphaned/duplicate chunks that make subsequent assertions incorrect.

The PR description explicitly states these fixtures should be reused for additional tests, so this will become a real problem quickly. Consider making the previous chunk count dynamic or resetting it to the correct value based on backend state, or using a per-test unique `doc_id`.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good feedback, ill add a comment in the code for future test authors about this but ive spent so long trying to get this test working that i dont care abt this in this pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or i can just clean up the index between test cases. either way ill do this later.

project_id=1,
persona_id=2,
)
time.sleep(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed-duration time.sleep makes tests fragile in slow CI environments

Using time.sleep(1) for Vespa and OpenSearch indexing propagation is a known source of test flakiness—on a loaded CI worker 1 second may not be enough for the write to be queryable. The vespa_document_index fixture already demonstrates a robust polling pattern (lines 145–154) that retries for up to 60 seconds. Applying a similar retry-until-consistent approach here would make these assertions more reliable:

import tenacity

@tenacity.retry(
    stop=tenacity.stop_after_delay(10),
    wait=tenacity.wait_fixed(0.5),
    reraise=True,
)
def assert_chunk_count(document_index, chunk_request, filters, expected_count):
    chunks = document_index.id_based_retrieval(chunk_requests=[chunk_request], filters=filters)
    assert len(chunks) == expected_count

This pattern (or an equivalent simple polling loop) applies to the time.sleep(1) at line 342 as well.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/tests/external_dependency_unit/document_index/test_document_index_old.py
Line: 316

Comment:
**Fixed-duration `time.sleep` makes tests fragile in slow CI environments**

Using `time.sleep(1)` for Vespa and OpenSearch indexing propagation is a known source of test flakiness—on a loaded CI worker 1 second may not be enough for the write to be queryable. The `vespa_document_index` fixture already demonstrates a robust polling pattern (lines 145–154) that retries for up to 60 seconds. Applying a similar retry-until-consistent approach here would make these assertions more reliable:

```python
import tenacity

@tenacity.retry(
    stop=tenacity.stop_after_delay(10),
    wait=tenacity.wait_fixed(0.5),
    reraise=True,
)
def assert_chunk_count(document_index, chunk_request, filters, expected_count):
    chunks = document_index.id_based_retrieval(chunk_requests=[chunk_request], filters=filters)
    assert len(chunks) == expected_count
```

This pattern (or an equivalent simple polling loop) applies to the `time.sleep(1)` at line 342 as well.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know man im tired, just read the comment above in the code about this.

@acaprau acaprau added this pull request to the merge queue Mar 10, 2026
Merged via the queue into main with commit 17551a9 Mar 10, 2026
94 checks passed
@acaprau acaprau deleted the andrei/260226/3/opensearch/fix-update-projects-and-personas-empty branch March 10, 2026 21:56
acaprau added a commit that referenced this pull request Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants